If you were not here for Lab 12, and need to install the graphviz package:

In [None]:
!pip install --user graphviz

# Lab 13 - Decision Trees for regression

For this lab, we will return to the insurance data from Labs 7 and 8.  Recall we are trying to predict the insurance cost, a quantitative value.  

If you don't have the dataset, download it from GitHub: [https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv)

In this data, each row represents an insurance policy and the 7 columns contain the following information about it:
- age: age of policy holder
- sex: sex of policy holder
- bmi: boday mass index (bmi) of policy holder.  bmi is a (sometimes unreliable) measurement of body fat in adults
- children: number of children (dependents) on the policy
- smoker: whether the policy holder is a smoker
- region: region of the country the policy holder lives in
- charges: price for insurance policy

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import tree
import graphviz
from graphviz import Source
 
from sklearn.model_selection import train_test_split

from sklearn.tree import export_graphviz
import sklearn.metrics as met
from sklearn.metrics import confusion_matrix

%matplotlib inline

Read the data into a dataframe and display it to make sure it was read in correctly:

Sci-kit learn decision trees require numeric data.  How can we convert the categorical columns into numeric data?  
Hint:  see Lab 8

## Fitting a decision tree with sci-kit learn

We can get just the independent variables (x's) using the following:

In [None]:
X = insurance.iloc[:,[0,1,2,4,5,6,7,8]]
X.head()

Next we created the decision tree variable (object) and then fit it to our data:

In [None]:
reg = tree.DecisionTreeRegressor(max_depth = 5)
reg = reg.fit(X, insurance["charges"])

If you are running Jupyter Hub on your own computer, you may be able to display the decision tree by:

In [None]:
tree.plot_tree(reg)

If you are using the Jupyter Hub server, run the following code (which will give an error):

In [None]:
dot_data = tree.export_graphviz(reg, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("insurance.dot")

However, despite the error, there should now be a file called happiness.dot in your directory.  To view the fitted decision tree, open the happiness.dot file in Jupyter and copy the text.  Paste this text into the text box at [http://www.webgraphviz.com](http://www.webgraphviz.com) and click the "Generate graph!" button at the bottom.

The column names have been replaced by `X[0], X[1], ..., X[7]`.  Run the following code to change `X[0], X[1], ..., X[7]` to the column names in insurance.dot.

In [None]:
with open ("insurance.dot", "r") as fin:
    with open("insurance_fixed.dot","w") as fout:
        for line in fin.readlines():
            line = line.replace("X[0]","age")
            line = line.replace("X[1]","bmi")
            line = line.replace("X[2]","children")
            line = line.replace("X[3]","sex_male")
            line = line.replace("X[4]","smoker_yes")
            line = line.replace("X[5]","region_northwest")            
            line = line.replace("X[4]","region_southeast")
            line = line.replace("X[5]","region_southwest")
            fout.write(line)

Copy the contents of insurance_fixed.dot into the textbox in [http://www.webgraphviz.com](http://www.webgraphviz.com) to display the decision tree with the column names.  How does it compare the the decision tree you made?

What happens if you change the `max_depth` parameter to 5 in DecisionTreeRegressor?

Look at the leaves of your new tree.  What's the smallest sample?  

A few of the leaves only have 1 sample.  How do you think this tree would work on other insurance data?

The single samples are a sign of over-fitting, and to fix it we can make `max_depth` smaller (but too small and our model will not be as good as it could be).

### Testing and training data

To figure out what `max_depth` should be, let's split our data into training and testing data. 

Create a decision tree with `max_depth = 3` from the training data:

Make predictions for the test data:

Compute the mean squared error for these predictions:

What is the mean squared error if you use `max_depth = 4`?

What is the mean squared error if you use `max_depth = 5`?

What about if you use `max_depth = 2`?

Which `max_depth` parameter should you use?  What is the corresponding decision tree?

You can also use a loop to quickly check the different parameter values for `max_depth`.  

In [None]:
dot_data = tree.export_graphviz(reg_depth3, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("insurance_depth3.dot")

In [None]:
with open ("insurance_depth3.dot", "r") as fin:
    with open("insurance_depth3_fixed.dot","w") as fout:
        for line in fin.readlines():
            line = line.replace("X[0]","age")
            line = line.replace("X[1]","bmi")
            line = line.replace("X[2]","children")
            line = line.replace("X[3]","sex_male")
            line = line.replace("X[4]","smoker_yes")
            line = line.replace("X[5]","region_northwest")            
            line = line.replace("X[4]","region_southeast")
            line = line.replace("X[5]","region_southwest")
            fout.write(line)

Finally, we can compare the mean squared error using a Decision Tree regressor to the mean squared error computed using linear regression in Lab 8, also based on a training/testing split of 0.2.  It was 41142821.67547247 (for my training/testing data).

Which model is better?

Return to the decision tree classifier from last lab.  Which `max_depth` is best?